MINING PATTERNS AND INSIGHTS FROM AIS DATA


We are living in the age of Big Data. Whether you embrace or not, this change profoundly affects all of us.


We are living in the age of Big Data. In other words, we are deluged with data. Whether you embrace it or not, this change profoundly affects all of us. As unprecedented as this may be, there are no historical figures who could offer us insightful wisdom on how to navigate this age. To be honest, a part of me feels overwhelmed by the sheer amount of data inundating us everyday and wants to avoid this at all costs, but the other part of me convinces me that it is our responsibility to embrace such change to better understand the world we live in.

In this project, I am working with a big data known as the Automatic Identification System (AIS). Modern vessels are equipped with AIS receivers, enabling them to communicate constantly with satellites. This communication allows us to collect real-time location information for vessels. Therefore, establishing a workflow to visualize AIS data in an automated way would certainly provide us with an edge over traditional sailors who relied solely on their experiential knowledge.

The project I am introducing here was initially started at the request of Korean Navy for the same reason I just mentioned. They have been relying on a somewhat primitive method and were seeking to develop a visualization application that can present AIS data with ease in a more aesthetically pleasing manner. It took a year to complete this project from start to finish. Throughout that period, I encountered bits of information here and there that could be used to visualize AIS data, but there was no single resource that synthesized all of this information together. Therefore, this article aims to fill the gap by introducing various methodological approaches I employed to develop the application in a more comprehensive manner.


Project Outline

For those who may not be familiar with the main subject of this project, I will begin by briefly discussing what AIS is and providing an overview of the dataset. Subsequently, I will introduce the visualization techniques I have employed for this project. The analysis methods used can be divided into two main categories: point pattern analysis and line density analysis. I will first delve into the details of the former before transitioning into the latter. The outline of this project is as follows:


Setting Up a Working Environment

Initially, I set up a working environment by importing essential libraries and configuring a working directory. Throughout this project, a wide range of analyses were conducted, resulting in the use of 25 distinct R libraries. If you expand the code tab below, you can see the complete list of these libraries. The dataset imported in this section has been pre-processed using a tool I developed. Further details about this tool will be elaborated on in the later sections.

library(rgdal)
library(dplyr)
library(RSQLite)
library(sf)
library(ggplot2)
library(ggblend)
library(viridis)
library(ggmap)
library(gganimate)
library(osmdata)
library(RColorBrewer)
library(ggpubr)
library(patchwork)
library(gganimate)
library(tmap)
library(units)
library(gifski)
library(DiagrammeR)
library(tidyverse)
library(rnaturalearth)
library(RMariaDB)
library(plotly)
library(move)
library(moveVis)
library(mapview)
library(Hmisc)

setwd("/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/AIS_visualization")

Dec_01_NEW_DATA <- read.csv("Dec_01_Cleaned.csv")

What is AIS data?

In this project, I have utilized a dataset known as AIS, which stands for Automatic Identification System. AIS serves as the international standard system for communication between vessels, ground stations, and satellites. Initially developed for military purpose, the system was later adopted by the International Maritime Organization (IMO). Presently large vessels exceeding 300 gross tons (GT) are mandated to transmit their location information to prevent potential collisions.

AIS data consists of both static and dynamic information. The static information include Maritime Mobile Satellite Identity(MMSI), ship name, ship type, and other vessel-specific details. On the other hand, dynamic data includes vessel’s current location, timestamp, speed, and course information. Table1 presents the variables utilized in this project along with brief description. Table2 displays the initial rows of AIS data, proving a preview of the data format.


Field Name Description
MMSI Maritime Mobile Satellite Identity
Timestamp The time the AIS message was transmitted
Longitude Vessel longitude in degrees
Latitude Vessel latitude in degrees
Speed Speed over ground in knots
Course Course over ground in degrees
Name Vessel name
Ship type Vessel ship and cargo type code
Table 1. AIS data field description


timestamp mmsi latitude longitude speed course shiptype name
2022-12-01 07:55:43 636022401 42.761 132.711 0.3 50 80 ATLANTICA
2022-12-01 11:21:20 536179300 35.513 129.395 0.0 220.3 70 GRACE
2022-12-01 08:35:23 636012808 34.089 128.462 1.4 127.8 70 MSC
Table 2. AIS data sample


Ship type No. of positions Percentage(%)
Cargo 232,878 28.08%
Tanker 85,841 10.35%
Fishing 74,568 8.99%
Tug 41,853 5.05%
Passenger 14,740 1.78%
Pilot vessel 6,022 0.73%
Other 149,849 18.07%
Table 3. AIS sample summary by type


Table 3 provide a brief summary of the percentage distribution of the AIS data. The Cargo and Tanker categories collectively represent around 40% of the dataset. The ‘Other’ category, which lacks identifiable static information, accounts for approximately 20%, whereas Fishing, Tug, and Passenger categories combined make up roughly 16% of the data.


point_map <- function(df, xmin=121, xmax=136, ymin=31, ymax=43) {
  
  library(sf)
  south <- st_read('/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/AIS_visualization',
                   quiet=TRUE, layer = "SouthKorea")
  north <- st_read('/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/AIS_visualization',
                   quiet=TRUE, layer = "NorthKorea")
  
  ggplot(df) + geom_sf(data=south) + geom_sf(data=north) +
    geom_point(mapping = aes(x=LONGITUDE, y=LATITUDE, colour=SHIPTYPE), size=0.5) +
    xlim(min(df$LONGITUDE - 0.5), max(df$LONGITUDE + 0.5)) +
    ylim(min(df$LATITUDE - 0.5), max(df$LATITUDE + 0.5)) + xlab(NULL) + ylab(NULL) + 
    xlim(xmin, xmax) + ylim(ymin, ymax)
  
}

point_map(Dec_01_NEW_DATA) 

The figure above displays the AIS data around the Korean peninsula as of December 1st, 2022, depicting approximately 1,440,000 data points collected in just a single day. While the initial observation reveals a high volumn of data points all around, it seems challenging to extract any meaningful insights directly from it. In the following discussion, I will explore diverse visualization methods to unearth valuable insights and patterns from this seemingly complex dataset. The investigation into this rich dataset exemplifies the essence of ‘data mining,’ a process that reveals hidden treasures within data. By delving into the details, you will understand why this field is aptly named ‘data mining.’“

Point Pattern Analysis Workflow

library(DiagrammeR)

DiagrammeR::grViz("               
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 

  # graph statement
  graph [layout = dot,
         rankdir = LR,            # layout top-to-bottom
         fontsize = 10]

  # nodes (circles)
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]                      

  # Main tree
  Original  [label = 'Original\nPoint Data'] 
  Hexagon [label = 'Create hexagon\n(e.g.,20km)'] 
  Count  [label = 'CountPoints\nin polygon'] 
  Spatial_concentration [label = 'Concentration\nVisualization']
  Raster_surface [label = 'Kernel Density\nEstimation']
  Heat_map [label = 'Heat map', shape =  square, fontcolor=blue, color=blue]
  Isopleth [label = 'Isopleth map', fontcolor = darkgreen, shape = square,
  color=darkgreen]
  
  #Branch1 : Creating 3D and animation
  SplitMMSI [label = 'Split by\nMMSI']
  OrderTime [label = 'Order by\ntimestamp']
  MovingObject [label = 'Create\nmoving object']
  ThreeD [label = '3D time-space\ntrajectory\nvisualization', fontcolor = darkgreen,
  shape = square, color=darkgreen]
  Animation [label = 'Animation\n(e.g.,gif, mp4)', fontcolor = darkgreen,
  shape=square, color=darkgreen]

  #Branch2: Spatial autocorrelation
  Choropleth [label = 'Choropleth\nmap']
  Auto_cor  [label = 'Anselin\nLocal Morans I', 
  shape=square, color=orange, fontcolor=orange] 

  # edges
  Original -> {Spatial_concentration Hexagon SplitMMSI}
  Spatial_concentration -> Raster_surface
  Raster_surface -> {Isopleth Heat_map}
  Hexagon -> Count                      
  Count -> Choropleth 
  
  ## Branch1: 3D and animation                        
  SplitMMSI -> OrderTime\
  OrderTime -> MovingObject[label = ' linear\n  interpolation', fontcolor=red]
  MovingObject -> ThreeD [style = dashed, color = darkgreen]
  MovingObject -> Animation [style = dashed, color = darkgreen]

  ## Branch2: Spatial Autocorrenation
  Choropleth -> Auto_cor[label = 'Statistical Validation\nof Spatial Clustering']
 }
")

Kernel Density Estimation(KDE) & Isopleth Mapping

When analyzing point distribution patterns in geographic space, we typically assume that these patterns are generated by an unknown distribution f(x) rather than by a Gaussian distribution. In this study, Kernel Density Estimation (KDE) was utilized to estimate the density distribution of AIS data. This method computes a continuous probability density distribution using a kernel density function defined as:

\(\hat{f}(x)\) = \(\hat{f}(x,y)\) = \(\displaystyle \frac{1}{nh_{x}nh_{y}}\sum_{i}^{n}k\left[\frac{x-x_{i}}{h_{x}},\frac{y-y_{i}}{h_{y}}\right]\)

The equation shown above initially generates kernel functions \(k\left[\frac{x-x_{i}}{h_{x}},\frac{y-y_{i}}{h_{y}}\right]\)around the observed data points. Subsequently, the sum of all relevant kernel functions is divided by the total number of data points, resulting in a probability density distribution function. Within the equation, ‘\(x\)’ and ‘\(y\)’ represent random variables, while ‘\(x_{i}\)’ and ‘\(y_{i}\)’ denote the observed data points. The parameter ‘\(h\)’ in the equation signifies the bandwidth and is used to smooth probability density function. Smaller bandwidths tend to produce a more spiky distribution, whereas larger bandwidths tend to produce more flattened distribution.

One of the most challenging and crucial tasks of estimating a kernel density is determining the appropriate bandwidths (\(h_{x}\), \(h_{y}\)) for a given dataset. In this project, a method proposed by Bowman and Azzalini(1997) was selected due to its simplicity and ease of application. The examples presented here were generated using a different algorithm in R; however the chosen method was employed in the actual project. The mathematical equation for this approach is as follows:

\(\displaystyle h_{x} = \sigma_{x}(\frac{2}{3n})^{\frac{1}{6}}\) \(\displaystyle h_{y} = \sigma_{y}(\frac{2}{3n})^{\frac{1}{6}}\)

In the equation, (‘\(\sigma_{x}\)’, ‘\(\sigma_{y}\)’) represent the standard deviation of x and y, whereas n denotes the total number of observation in the dataset.

contour_map <- function(df, xmin=121, xmax=136, ymin=31, ymax=43) {
  
  south <- st_read('/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/AIS_visualization',
                   quiet=TRUE, layer = "SouthKorea")
  north <- st_read('/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/AIS_visualization',
                   quiet=TRUE, layer = "NorthKorea")
  
  df <- df %>% filter(LONGITUDE > xmin & LONGITUDE < xmax & LATITUDE > ymin & LATITUDE < ymax)
  
  ggplot(data = df) + geom_sf(data=south) + geom_sf(data=north) +
    geom_density_2d(mapping = aes(x=LONGITUDE, y=LATITUDE, alpha=0.7), lwd=0.1) +
    geom_density_2d_filled(mapping = aes(x=LONGITUDE, y=LATITUDE, alpha=0.7)) +
    xlim(xmin, xmax) + ylim(ymin, ymax) +
    theme(legend.position = "none")
  
}

point_map(Dec_01_NEW_DATA, xmin=125, xmax=128, ymin=32.5, ymax=34 ) # point map around Jeju Island
contour_map(Dec_01_NEW_DATA, xmin=125, xmax=128, ymin=32.5, ymax=34) # KDE & contour map around Jeju Island

In this visualization, both the point map and KDE (Kernel Density Estimation) map are presented simultaneously. The point map renders AIS data as discrete points, while KDE creates a continuous raster surface. In general, the point map method struggles with identifying clear clustering patterns, especially as the number of points increases. In contrast, the KDE technique effectively reveals the intrinsic clustering patterns within the dataset, regardless of the number of points present.

Additionally, leveraging KDE output to generate contour lines introduces another interesting way of visualizing point density patterns. Since KDE results in a continuous raster surface, it enables the creation of iso-lines by connecting pixels of identical estimated values, thus producing a contour map (or isopleth map) of AIS data. As the figure above clearly shows, overlying the contour map over the KDE could significantly enhance the visibility of point clustering patterns. The final composite map highlights the northern and western regions of Jeju Island and the interconnecting corridor as the most dominant hotspots of AIS traffics.

Creating a Choropleth Map

choropleth_map <- function(df, cell_size=50, square=FALSE, interactive = FALSE) {
  
  ais_df_test <- df %>% filter(!is.na(LONGITUDE) & !is.na(LATITUDE)) %>%
    st_as_sf(coords = c("LONGITUDE","LATITUDE"), crs = 4326 , remove = FALSE)
  
  #converting AIS dataframe into st_geometry() and transform its projection
  ais_df_test <- ais_df_test %>% st_set_crs(4326) %>% st_transform(3857)
  
  #Creating hex grid; for now I am using 70*1000 as a cell size; EPSG3857 is in meter, therefore 70*1000 means 70 km
  grd <- st_make_grid(ais_df_test, cell_size*1000, square=square, flat_topped = TRUE)
  
  # To sf and add grid ID
  fishnet_grid_sf = st_sf(grd) %>%
    # add grid ID
    mutate(grid_id = 1:length(lengths(grd)))
  
  # intersect fishnet grid and AIS point data, then counting how many points there are per cell
  fishnet_grid_sf$n_colli = lengths(st_intersects(fishnet_grid_sf, ais_df_test))
  
  # remove grids whose values are 0 (i.e. no points in side that grid)
  fishnet_count = filter(fishnet_grid_sf, n_colli > 0)
  
  if(interactive == FALSE){
    
    ggplot() +
      geom_sf(data=fishnet_count, aes(fill=n_colli), color="gray", lwd=0.1) +
      theme_bw() + scale_fill_viridis_c(option="D", alpha=0.5, direction=-1) +
      geom_sf(data=south) + geom_sf(data=north)
    
  } else if (interactive == TRUE) {
    
    tmap_mode("view")
    
    map_fishnet = tm_shape(fishnet_count) +
      tm_fill(
        col = "n_colli",
        palette = "YlOrRd",
        style = "cont",
        title = "Number of AIS points",
        id = "grid_id",
        showNA = FALSE,
        alpha = 0.5,
        popup.vars = c("Number of AIS: " = "n_colli"),
        popup.format = list(n_colli = list(format = "f", digits = 0))) +
      tm_borders(col = "grey40", lwd = 0.1)
    
    map_fishnet
  }
}

size_30_km <- choropleth_map(Dec_01_NEW_DATA, cell_size=30, square=FALSE, interactive=FALSE) # interactive=FALSE
size_50_km <- choropleth_map(Dec_01_NEW_DATA, cell_size=50, square=FALSE, interactive=FALSE) # interactive=FALSE

choropleth_three_combined <- patchwork(size_30_km, size_50_km, ncol=2)

Choropleth mapping was also employed as an additional way to visualize AIS data. While this method shares similarities with the preceding techniques, what distinguishes choropleth mapping is its simplicity and intuitiveness. Although primarily serving as a preliminary step to the Moran’s I analysis in this project, choropleth mapping is a noteworthy tool for visualizing spatial data. To maintain brevity, a detailed discussion of this method is omitted here. Instead, this article focuses on presenting the step-by-step process for creating a choropleth map in R:

  1. Generate a hexagon grid over AIS data points in the study area
    • the figure above used the hexagon size of 30km and 50 km, respectively
  2. Use spatial join method to count the number of points within each hexagon
  3. Choose a data classification method (i.e., natural break, standard deviation, equal interval, quantile, etc.)
  4. Determine the number of breaks
  5. Pick a sequential color palette of your choice and visualize the results

The choropleth map above reveals the higher volume of AIS signals in the vicinity of major ports around South Korea. We can see that most dominant clusters are formed around Busan, Incheon, and Jeju port, etc. However, this map alone does not provide conclusive evidence to determine if the observed clustering pattern holds statistical significance, nor does it explain the extent to which similar features are clustered or dispersed. This is where Moran’s I analysis proves valuable, with its methodological approach set to be explored in the subsequent section.

Anselin Local Moran’s I: Cluster and Outlier Analysis

“The first law of geography: Everything is related to everything else, but near things are more related than distant things.”

– Waldo R. Tobler (Tobler 1970)

Global Moran’s I Analysis

Moran’s I analysis is one of the most widely employed methods for evaluating spatial autocorrelation, which assesses how a variable correlates with itself across geographic space. By capturing the extent to which similar features are clustered in space, Moran’s I directly reflects the First Law of Geography: ‘Everything is related to everything else, but near things are more related than distant things.’

The equation for the Global Moran’s I equation can be expressed as:

\(I\) = \(\displaystyle\frac{N}{\sum_{i}(X_{i}-\overline{X})^{2}}\frac{\sum_{i}\sum_{j}w_{ij}(X_{i}-\overline{X})(X_{j}-\overline{X})}{\sum_{i}\sum_{j}w_{ij}}\)

In the equation, \(N\) denotes the total number of regions, and \(X_{i}\) represents the value of interest in region \(i\). \(\overline{X}\) signifies the mean of all value, while \(w_{ij}\) denotes a spatial weight between feature \(i\) and \(j\) in the dataset.

# 1) Turning data frame into sf object with also assigning crs=4326
ais_df_moran <- Dec_01_NEW_DATA %>% filter(!is.na(LONGITUDE) & !is.na(LATITUDE)) %>%
  st_as_sf(coords = c("LONGITUDE","LATITUDE"), crs = 4326 , remove = FALSE)

# 2) re-projecting layer to EPSG 3857
ais_df_moran <- ais_df_moran %>% st_set_crs(4326) %>% st_transform(3857)

#Creating hex grid; for now I am using 70*1000 as a cell size; EPSG3857 is in meter, therefore 70*1000 means 70 km
grd <- st_make_grid(ais_df_moran, 50*1000, square=FALSE, flat_topped = TRUE)

# To sf and add grid ID
fishnet_grid_sf = st_sf(grd) %>%
  # add grid ID
  mutate(grid_id = 1:length(lengths(grd))) 

# intersect fishnet grid and AIS point data, then counting how many points there are per cell
fishnet_grid_sf$n_colli = lengths(st_intersects(fishnet_grid_sf, ais_df_moran)) 
fishnet_grid_sf <- fishnet_grid_sf %>% filter(n_colli > 0)  #lengths(st_intersects(fishnet_grid_sf, ais_df_moran)) 
fishnet_grid_sf$n_colli_log = log(fishnet_grid_sf$n_colli) #log-transform

test_poly <- fishnet_grid_sf

library(spdep)

nb <- poly2nb(test_poly, queen=TRUE)
nbw <- nb2listw(nb, style="W", zero.policy = TRUE)

#drawing neighborhood connection map: spatial: neightborhood based on contiguity
par(mar=c(.5,.5,.5,.5))
plot(st_geometry(test_poly), border = "lightgray")
plot.nb(nb, st_geometry(test_poly) ,add=TRUE, lwd=0.4)

The mathematical equation demonstrates that Moran’s I index is fundamentally a measure of covariation between a value of interest and its spatially lagged mean. However, before the index can be computed, it is necessary to establish a clear definition of what constitutes a neighbor. One method defines a neighbor based on contiguity, while other methods rely on a distance threshold or the number of nearest neighbors. In this project, we have adopted a definition based on contiguity, an example of this is shown in the figure above.

par(mar=c(4.3,4.3,0.5,4.3))
moran.plot(test_poly$n_colli_log, nbw, xlab="AIScount", ylab="AIScount_lagged")

I began by creating a uniform-sized hexagon (e.g., 100 km) over the study area and then counted the number of AIS points for each hexagon using a spatial join method, a robust technique for summarizing data based on spatial relationships. The next step involved computing spatially lagged means (\(AIScount_{lagged}\)) for each hexagon, which were then used to generate a bivariate Moran’s I plot shown above. The plot shows a best-fit line from OLS regression, where the slope represents the Moran’s I coefficient indicating the degree of relationship between \(AIScount\) and \(AIScount_{lagged}\). A preliminary look at this plot suggests a strong positive relationship between the two variables, indicating that higher counts tend to be surrounded by other higher counts, and vice versa for lower counts.

However, this plot alone does not provide enough evidence on whether this observed Moran’s I statistic significantly differs from 0. To further support this statistically, we employed a Monte Carlo approach. Under the the assumption of independence among observations, the Monte Carlo method randomizes spatial patterns by reassigning the attribute values among the hexagons first, and a Moran’s I index is calculated for every permutation. The resulting sampling distribution is then compared with the observed Moran’s I value to test if the assumption of no spatial auto-correlation can be rejected.

gmoranMC <- moran.mc(test_poly$n_colli_log, nbw, nsim = 4999, zero.policy = TRUE)
gmoranMC 
## 
##  Monte-Carlo simulation of Moran I
## 
## data:  test_poly$n_colli_log 
## weights: nbw  
## number of simulations + 1: 5000 
## 
## statistic = 0.7279, observed rank = 5000, p-value = 2e-04
## alternative hypothesis: greater
par(mar=c(4.3,4.3,0.5,4.3))
hist(gmoranMC$res, breaks=300, border="grey", main = NULL, xlab="Moran's I Values via Monte Carlo Simulation", ylab = NULL)
abline(v = gmoranMC$statistic, col = "red")

After running 4,999 Monte Carlo simulations, the results reveal that our observed statistic is extremely unlikely to occur under the assumption of no spatial-autocorrelation (p-vlaue = 0.0002). A p-value of 0.0002 means a mere 0.02% chance of making an error in rejecting the null hypothesis. By the way, the smallest possible p-value one can get from running 4,999 simulations is 1/(4999+1), which is essentially same as what we got from this experiment. In other words, the Monte Carlo simulation in this case did not produce any values that are more extreme than the observed. Consequently, we can confidently assert that our observed clustering pattern is indeed statistically significant.

The Local Indicators of Spatial Association

Though the Global Moran’s I is a valuable tool for examining overall patterns of spatial autocorrelation, this approach is limited because it does not uncover local variations within the dataset. In other words, Global Moran’s I does not indicate where within the dataset local clustering patterns are most pronounced, nor does it highlight areas that deviate significantly from the general trend. This is where the Local Indicators of Spatial Association (LISA) method is proven to be instrumental. Simply put, the LISA complements Global Moran’s I by helping us to identify statistically meaningful local clusters and spatial outliers, thus this method is also called ‘cluster and outlier analysis.’

The equation for LISA can be expressed as:

\(I_{i}\) = \(\displaystyle\frac{n(Y_{i}-\overline{Y})}{\sum_{j}(Y_{j}-\overline{Y})^{2}}\sum_{j}w_{ij}(Y_{j}-\overline{Y})\)

In the equation above, \(n\) denotes the total number of neighbors, and \(Y_{i}\) represents the value of interest in region \(i\). \(\overline{Y}\) signifies the mean of all neighbors, while \(w_{ij}\) denotes a spatial weight between feature \(i\) and \(j\) in a given neighborhood.

One of the special properties of Global Moran’s I is that it can be broken down into a series of local statistics. Put differently, the Global Moran’s I is the sum of all local Moran’s I coefficients divided by the sum of standardized weights as the equation below indicates.

\(I\) = \(\displaystyle\frac{1}{\sum_{i \neq j}w_{ij}}\sum_{i}I_{i}\)

lmoran <- localmoran(test_poly$n_colli_log, nbw, alternative = "two.sided", zero.policy = TRUE)

tmap_mode("view")

test_poly$lmI <- lmoran[,"Ii"]
test_poly$lmZ <- lmoran[,"Z.Ii"] #z-score
test_poly$lmp <- lmoran[,"Pr(z != E(Ii))"]

tm_shape(test_poly) + tm_polygons(col = "lmI", title="Local Moran's I", style="quantile", alpha=0.5, lwd=0.05) + tm_layout(legend.outside = TRUE)


Once the local Moran’s I is computed for the dataset, its resulting values can be visualized on a map, which then can help identify the local clustering patterns of similar values. In the map above, the resulting Moran’s I values range between -1.670 and 3.467. Higher Moran’s I values suggest that areas have similar neighbors in their vicinity, while lower Moran’s I statistics implies that areas have dissimilar neighbors around them. At this point, this map alone cannot reveal if such areas are clusters of high, low, or moderate values, nor does it offer any conclusive evidence on whether such patterns are statistically meaningful - an aspect which will be discussed in the following section.

tm_shape(test_poly) + tm_polygons(col = "lmp", title="p-value", breaks = c(-Inf, 0.05, Inf), alpha=0.5, lwd=0.05) + tm_layout(legend.outside = TRUE)


Similar to what was done with the Global Moran’s I method, Monte Carlo simulation method can be employed for computing pseudo p-values for LISA. However,in this case, a value of interest at region i - \(Y_{i}\) - is fixed while we shuffle all other values in the dataset - thus known as a ‘conditional randomization.’ At each permutation we can compute a local Moran’s I value, and when this process is repeated for a given number of times, a distribution of local Moran’s I can be generated under the null hypothesis of no spatial association. Subsequently, a p-value is the probability of observing more extreme Moran’s I values than the observed one for a given hexagon. If we carry out Monte Carlo simulation for all hexagons, we can compute pseudo p-values for all hexagons in the study area. In the map above, the hexagons colored with light beige are the regions where the probability of getting more extreme values than the observed by random chance are less than 5 %, thus it is safer to conclude that there are some statistically significant local clustering patterns in our study area.

Lastly, the final Local Moran’s I map is produced by employing the categorization scheme below:

  1. A hexagon is categorized as “high-high” if both \(AIScount\) and \(AIScount_{lagged}\) are above average with a significant p-value
  2. A hexagon is categorized as “low-low” if both \(AIScount\) and \(AIScount_{lagged}\) are below average with a significant p-value
  3. A hexagon is categorized as “high-low” if \(AIScount\) is above average while \(AIScount_{lagged}\) is below average with a significant p-value
  4. A hexagon is categorized as “low-high” if \(AIScount\) is below average while \(AIScount_{lagged}\) is above average with a significant p-value
  5. A hexagon is categorized as “non-significant” if a p-value is not significant
test_poly$quadrant <- NA

#high-high
test_poly[(mp$x >= 0 & mp$wx >= 0) & (test_poly$lmp <= 0.05), "quadrant"] <- 1

#low-low
test_poly[(mp$x <= 0 & mp$wx <= 0) & (test_poly$lmp <= 0.05), "quadrant"] <- 2

#high-low
test_poly[(mp$x >= 0 & mp$wx <= 0) & (test_poly$lmp <= 0.05), "quadrant"] <- 3

#low-high
test_poly[(mp$x <= 0 & mp$wx >= 0) & (test_poly$lmp <= 0.05), "quadrant"] <- 4

#non-significant
test_poly[(test_poly$lmp > 0.05), "quadrant"] <- 5


#Drawing interactive map with tm package
tm_shape(test_poly) + tm_fill(col = "quadrant", title = "", breaks = c(1,2,3,4,5,6),
                              palette = c("red", "blue", "lightpink", "skyblue2", "white"),
                              labels = c("High-High", "Low-Low", "High-Low", "Low-High", "Non-significant"), alpha=0.4) +
  tm_legend(text.size=1) + tm_borders(alpha=0.5, lwd=0.05) + tm_layout(frame=FALSE, title = "Clucters") + tm_layout(legend.outside = TRUE)


In the map above, the ‘high-high’ category (colored with red) represents areas where high values are surrounded by other high values while the ‘low-low’ category (colored with blue) indicates regions where low values are neighboring other low values. The ‘low-high’ and ‘high-low’ categories are used to identify outliers - the former highlights areas of low values with neighbors of high values and vice versa for the latter.

3D Time-Space Trajectory Analysis

#1) Extract 4 paths to be visualized
count <- Dec_01_NEW_DATA %>% group_by(MMSI) %>% count() %>% arrange(desc(n))
ais_3d <- Dec_01_NEW_DATA
ais_3d$MMSI <- as.character(ais_3d$MMSI)
tracks_3d <- ais_3d %>% filter(MMSI == "636020542_0"|MMSI == "352002128_0"|MMSI == "636012808_0"|MMSI == "432596000_0")

#2) Defininfg timestamp field as it will be used as a z-axis in the 3d visualization
tracks_3d$TIMESTAMP <- as.POSIXct(tracks_3d$TIMESTAMP, format = "%Y-%m-%d %H:%M:%S", tz = "UTC")

#3) Arranging the dataset by MMSI and TIMESTAMP
tracks_3d <- tracks_3d %>% arrange(MMSI, TIMESTAMP)

#4) Finally 3D visualization
fig_4paths <- plot_ly(tracks_3d, x = ~LONGITUDE, y = ~LATITUDE, z = ~TIMESTAMP, type = 'scatter3d', color = ~MMSI,
                      mode = 'lines') %>% add_markers(color=~MMSI, size = I(15)) %>%
  layout(scene = list(xaxis = list(title = 'Longitude'),
                      yaxis = list(title = 'Latitude'),
                      zaxis = list(title = "Time")))

fig_4paths



Generally, a space-time trajectory plot is used to visualize the movement of objects or individuals in both time and space simultaneously. This technique, by adding a time dimension to a plot, effectively visualizes movement in 3 dimensional space. For example, in the figure above, the horizontal axis represents Cartesian coordinate space while the vertical axis represent a temporal dimension. The trajectories of four individual vessels are displayed in the plot, and we can clearly observe how the locations of those vessels progress over time.

Specifically, the orange-colored trajectory in the figure above displays the movement of “432596000_0” on 2022-Dec-01. The plot shows that at 00:00:43 am this vessel departed from its initial location of (131.4158, 33.85168). After making several small and large turns over time, it reached its final coordinate of (130.1209, 32.74085) around midnight. Such rich information would not have been uncovered without the utilization of time-space trajectory plot. To put it differently, time-space plot definitely offers a more holistic representation of vessels’ movement by integrating space and time dimensions.

However, a space-time trajectory is not without limitations. For instance, if we display too many vessels simultaneously on a plot, it may obscure the very patterns we are trying to uncover. Therefore, caution must be exercised to determine the appropriate number of vessels to display on a single plot. Despite this, I believe that its advantages far outweigh its shortcomings, and its potential applications in maritime traffic control are promising. It can help us understand changes in temporal clustering patterns, peak activity times, traffic flow, congestion points, etc. I am hopeful that these valuable insights can be applied to improve service provision, planning, policy implementation, and other decision making processes.

Animation Visualization

ais_animation <- function(df, xmin=121, xmax=136, ymin=31, ymax=43) {
  
  #Calling necessary libraries
  library(ggplot2)
  library(gganimate)
  
  #Defining a data frame and selecting necessary fields
  ais_df <- df
  tracks <- ais_df %>% dplyr::select(LONGITUDE, LATITUDE, MMSI, TIMESTAMP, SHIPNAME, SPEED, SHIPTYPE)
  
  #Casting a timestamp field from character to POSIXct
  tracks$TIMESTAMP <- parse_date_time(tracks$TIMESTAMP, "Ymd HMS")
  
  #Re-arranging tracks data by timestamp for each mmsi
  tracks_ordered <- tracks %>% arrange(MMSI, TIMESTAMP)
  
  #Deleting duplicate timestamp records
  tracks <- tracks_ordered[!duplicated(tracks_ordered[,c("TIMESTAMP")]),]
  
  #Casting MMSI field from integer to factor
  tracks$MMSI <- as.factor(tracks$MMSI)
  
  #Filtering data - ships that has greater than 10 location points
  count <- tracks %>% group_by(MMSI) %>% count() 
  tracks <- inner_join(tracks, count, by = "MMSI") %>% filter(n > 10)
  
  #Creating a Large MoveStack data
  mdata <- move(x=tracks$LONGITUDE, y=tracks$LATITUDE, time=tracks$TIMESTAMP, 
                data=tracks, proj = CRS("+proj=longlat +ellps=WGS84"),
                animal=tracks$MMSI)
  
  #Aligning movement data to a uniform time scale with a uniform temporal resolution throughout the data
  m <- align_move(mdata, res=5, digit=0, unit="mins", spaceMethod = "greatcircle")
  
  #Casting Large MoveStack data into a data.frame again
  tracks <- as.data.frame(m)
  
  #Producing an animation plot using ggplot and gganimate (original one)
  ggplot() + geom_sf(data=south) + geom_sf(data=north) +
    geom_point(data = tracks, aes(x=x,y=y, group=trackId, colour=trackId), alpha = 0.7, shape=20) +
    theme(legend.position = 'none') + transition_time(time) + labs(title = "Time: {frame_time}") +
    shadow_mark(alpha = 0.3, size = 0.5) + xlim(xmin, xmax) + ylim(ymin, ymax)
  
}

There isn’t a one-size-fits-all solution. While a certain method may excel in emphasizing a particular aspect of a dataset, it invariably involves trade-offs. Therefore, having a wide array of tools at your disposal is advantageous, allowing for the selection of the most tailored solution for the problem at hand. In this context, I would like to introduce the method of animation visualization.

Both 2D points and time-space trajectory maps are not most suitable for visualizing vessels’ speed or heading. As previously discussed, a time-space trajectory approach is effective only for visualizing a smaller number of vessels at a time. In contrast, the animation method overcomes these limitations by enabling us to observe the movement patterns of a large number of vessels from an omniscient point of view all at once as they unfold across the dimension of time.

As appealing as this approach may sound, there are still a few challenges to overcome. One significant hurdle arises from the irregular time interval at which the original AIS data are collected. Factors such as distances to nearby stations, vessel-to-vessel distances, traffic density, and other variables are known to cause such inconsistency. This problem poses a major challenge in reconstructing vessels’ continuous paths, which is a crucial preliminary step for generating an animation plot.

To address this issue, I employed a linear point interpolation technique. This method interpolates points at a regular time interval based on the known locations. After this, frames of 5-minutes time interval were generated and merged to create the final animation plot. For animation effects, I utilized the ‘shadow_mark’ and ‘shadow_wake’ methods available in gganimate library. The shadow_mark method displays all past point traces in a plot, while the shadow_wake method visualizes preceding frames with a gradual fall off. Examples of both techniques are demonstrated below.

AIS animation for Korean Peninsula with the shadow_mark effect

ais_animation(Dec_01_NEW_DATA) #showing entire study area
ais_animation(Dec_01_NEW_DATA, xmin=126, xmax=132, ymin=32, ymax=36) #Showing only the south east region of South Korea
ais_animation(Dec_01_NEW_DATA, xmin=122, xmax=128, ymin=35, ymax=38) #Showing only the west region of South Korea

AIS animation for Korean Peninsula with the shadow_wake effect

AIS animation around Incheon with the shadow_mark effect

AIS animation around Busan with an enhanced color scheme

Trajectory(line) Density Analysis

All the preceding methods discussed fall under the heading of point pattern analysis. In contrast, the methods I am about to introduce in the following sections belongs to the realm of line pattern analysis. While we have covered a significant amount already, there is more ground to explore ahead! The diagram below neatly summarizes the methods used and the logical workflow employed for trajectory density analysis.

I will begin by discussing the detailed steps involved in pre-processing the original AIS data, an essential yet time-consuming process in any data science project. Following this, I will delve into ‘points_to_paths’ analysis and feature blending effects. I will then briefly showcase the results of the radar/sonar detection range analysis. This analysis was initially included at the request of the Korean Navy to visualize the cumulative detection range changes during their operations. Additionally, I will demonstrate how to combine point and line layers to enhance map readability. Lastly, I will introduce trajectory clustering analysis ??? an effective method for summarizing line density patterns ??? along with its analysis results.

Trajectory Density Analysis Workflow

DiagrammeR::grViz("               
digraph surveillance_diagram {    # 'digraph' means 'directional graph', then the graph name 

  # graph statement
  
  graph [layout = dot,
         rankdir = LR,            # layout top-to-bottom
         fontsize = 10]

  # nodes (circles)
  
  node [shape = circle,           # shape = circle
       fixedsize = true
       width = 1.3]                      

  # Main tree
  Original  [label = 'Original\nPoint Data'] 
  SplitMMSI [label = 'Split by\nMMSI']
  OrderTime [label = 'Order by\ntimestamp']
  Threshold [label = 'Ti+1 - Ti >\nThreshold\n(e.g.,5hours)',
  shape=diamond, height=1.6, width=1.6, color=blue, fontcolor=blue]
  Separate_lines [label = 'CreateSeparate\nlines', color=blue]
  One_line [label = 'Create\ncontinuous line', color=blue, fontcolor=blue]
  GIS [label = 'Merge\ninto\nGIS\nlinestring']
  Line_Density [label =  'LineDensity\nVisualization',
  shape=square, height = 1.6, width = 1.6, color = orange]
  Detection_range [label = 'Randar\nDetection Range\nAnalysis', 
  fontcolor = darkgreen, color=darkgreen, shape=square ]
  Join_attribute [label = 'JoinBy\nattribute', 
  fontcolor = black, color=black]
  Feature_blending [label = 'Feature\nblending\neffects', 
  fontcolor = darkgreen, color=darkgreen, shape = square]
  Traj_Clus [label =  'Trajectory\nClustering\n\n-DBSCAN\n-HDBSCAN\n-Kmedoids',
  shape=square, height = 1.6, width = 1.6, color = orange]
  Vis_Type [label = 'Visualization\nbyType', 
  shape=square, color=orange, fontcolor=orange]
  Filtering [label = 'Filtering\nby users']
  Overlay_point [label = 'Overlay\npointLayer']
  Speed [label='Visualization\n by speed', 
  shape=square, color=orange, fontcolor=orange]
  Heading[label='Visualization\n by heading',
  shape=square, color=orange, fontcolor=orange]

  # edges
  Original ->  SplitMMSI 
  SplitMMSI -> OrderTime                      
  OrderTime -> Threshold
  Threshold -> Separate_lines[label = 'yes', fontcolor=red, style=dashed, color=blue]
  Threshold -> One_line [label = 'no', fontcolor=red, style=dashed, color=blue]
  {Separate_lines One_line} -> GIS[style=dashed, color=blue]  
  GIS -> {Line_Density Traj_Clus}
  Line_Density -> {Detection_range Join_attribute Feature_blending} [style=dashed,
  color=darkgreen]
  Join_attribute -> {Filtering Vis_Type}
  Filtering -> Overlay_point
  Overlay_point -> {Speed Heading} [style=dashed, color=orange]
  Vis_Type -> {Speed Heading} [style=dashed, color=orange]
  }
")

Data Pre-processing

In data science, the importance of data cleaning cannot be overemphasized. I believe this is the single most important step in any data science project. Socrates’ age-old wisdom, ‘Know thyself,’ can be rephrased as “know thy data” in the realm of data science. This fundamental principle is, however, often neglected by many practitioners. Undoubtedly, we all come to know our data better while tidying up a given dataset.

The importance of data cleaning, I believe, increases in proportion to the size of your data. This is because an increasing volume of data could also increase the probability of noise. In our case, the dataset for December 1st alone has 1,440,914 rows and 24 columns, which amounts to 375MB of storage space. This could quickly add up to several hundred gigabytes or even terabytes of storage space, depending on the time and the spatial scale of your project. Just like gold mining involves the process of separating gold from soil or gravel, data mining involves removing unwanted noise from signals. The following steps and code describe a detailed data clean-up process employed to that end.

  1. Select necessary columns for data visualization
  2. Create a “shiptype” column and assign new values based on existing field
  3. Creating a xy_coordinate field as a preliminary step to remove redundant coordinates
  4. Split the dataset into a large list using MMSI (a unique vessel identifier)
  5. Define a function called ‘f_time’
  6. Apply the f_time function to a large list using ‘lapply’ to remove redundant timestamps
  7. Define a function called ‘f_xy’
  8. Apply the f_xy function to a large list using ‘lapply’ to remove redundant coordinates
  9. Convert the large list back into a data.frame again
  10. Create HIGHER_TYPE field
  11. Create Radar and Sonar range fields
  12. Remove unnecessary “,” from the SHIPNAME field
  13. Convert data.frame into data.table
  14. Cast TIMESTAMP field into POSIXTct
  15. Arrange data.table by MMSI and TIMESTAMP
  16. Compute time-lags within each MMSI
  17. Identify points where time lags exceed a pre-defined time threshold
  18. Use a cumulative sum function to assign new_ID within each MMSI
  19. Create MMSI_NEW field by combining MMSI and new_ID
  20. Remove vessels whose total number of points is less than a specified threshold
  21. Compare the dataset before and after the processing
ais_data_cleaning <- function(df, threshold= 10,cutoff_mins=90  ,layer_name) {
  
  #extracting fields that are needed for visualization
  large_data <- df %>% dplyr::select(ship_and_cargo_type, name, mmsi, timestamp, course, speed, longitude, latitude)
  
  #creating "shiptype" column
  large_data <- large_data %>%
    mutate(SHIPTYPE = case_when(ship_and_cargo_type == "70" | ship_and_cargo_type == "71" |
                                ship_and_cargo_type == "79" ~ "Cargo",
                                ship_and_cargo_type == "80" ~ "Tanker",
                                ship_and_cargo_type == "52" ~ "Tug",
                                ship_and_cargo_type == "30" ~ "Fishing",
                                ship_and_cargo_type == "60" ~ "Passenger",
                                ship_and_cargo_type == "50" ~ "Pilot",
                                .default = "Other"))
  
  #creating a xy_coords field to delete duplicate coordinates
  large_data <- large_data %>%
    mutate(lon = round(longitude, 3),
           lat = round(latitude, 3),
           xy_combined = paste(as.character(lon), ", ",
                               as.character(lat))) %>% dplyr::select(-c(lon, lat))
  
  
  #split a dataset into a large list by mmsi
  split <- split(large_data, large_data$mmsi)
  
  #a function that removes duplicate timestamps
  f_time <- function(x) x[!duplicated(x[,c("timestamp")]),]
  
  #a function that removes duplicate cy_coords
  f_xy <- function(x) x[!duplicated(x[,c("xy_combined")]),]
  
  #applying timestamp removing function
  split <- lapply(split, f_time)
  
  #applying xy_coords removing function
  split <- lapply(split, f_xy)
  
  #converting a large list back into a dataframe again
  large_data <- do.call(what="rbind", split) %>% dplyr::select(-xy_combined)
  
  #selecting necessary fields in a right order
  large_data <- large_data %>% dplyr::select(SHIPTYPE, name, mmsi, timestamp, course, speed, longitude, latitude)
  
  #Rename columns
  colnames(large_data) <- c("SHIPTYPE","SHIPNAME", "MMSI", "TIMESTAMP", "COURSE", "SPEED", "LONGITUDE", "LATITUDE")
  
  #creating "HIGHER_TYPE" & "RADAR" & "SONAR" fields
  #the following fields are created as they are needed in the model developed for the NAVY
  large_data <- large_data %>% mutate(HIGHER_TYPES = case_when(SHIPTYPE == "Cargo" ~ "FIRST",
                                                               SHIPTYPE == "Tanker" ~ "SECOND",
                                                               SHIPTYPE == "Tug" ~ "THIRD",
                                                               SHIPTYPE == "Fishing" ~ "FOURTH",
                                                               SHIPTYPE == "Passenger" ~ "FIFTH",
                                                               SHIPTYPE == "Pilot" ~ "SIXTH",
                                                               .default = "OTHER")) %>%
    mutate(RADUIS = case_when(SHIPTYPE == "Cargo" ~ 15000,
                              SHIPTYPE == "Tanker" ~ 12000,
                              SHIPTYPE == "Tug" ~ 10000,
                              SHIPTYPE == "Fishing" ~ 8000,
                              SHIPTYPE == "Passenger" ~ 7000,
                              SHIPTYPE == "Pilot" ~ 6000,
                              .default = 5000)) %>%
    
    mutate(SONAR = case_when(SHIPTYPE == "Cargo" ~ 12000,
                             SHIPTYPE == "Tanker" ~ 9000,
                             SHIPTYPE == "Tug" ~ 7000,
                             SHIPTYPE == "Fishing" ~ 5000,
                             SHIPTYPE == "Passenger" ~ 4000,
                             SHIPTYPE == "Pilot" ~ 3000,
                             .default = 2000))
  
  #removing ', ' from the shipname field
  large_data$SHIPNAME <- gsub(","," ", large_data$SHIPNAME)
  
  # ChatGPT's suggestion to make previous code run faster (as of Feb/17/2024)
  library(data.table)
  
  # Convert to data.table
  setDT(large_data)

  # Convert TIMESTAMP to POSIXct
  large_data[, TIMESTAMP := as.POSIXct(TIMESTAMP)]
  
  # Sort by MMSI and TIMESTAMP
  setorder(large_data, MMSI, TIMESTAMP)
  
  # Compute time_lag
  large_data[, time_lag := TIMESTAMP - shift(TIMESTAMP, fill = first(TIMESTAMP)), by = MMSI]
  
  # Compute time_lag_exceeds_threshold
  cutoff_mins <- minutes(cutoff_mins)  # Adjust this threshold as needed
  large_data[, time_lag_exceeds_threshold := time_lag > cutoff_mins]
  
  # Compute group_id
  large_data[, group_id := cumsum(time_lag_exceeds_threshold), by = MMSI]
  
  # Compute MMSI_NEW
  large_data[, MMSI_NEW := paste0(MMSI, "_", group_id)]
  
  # Sort by MMSI_NEW and TIMESTAMP
  setorder(large_data, MMSI_NEW, TIMESTAMP)
  
  #counting number of points per MMSI
  count <- large_data %>% group_by(MMSI_NEW) %>% count()    
  
  #Inner_join count data to large data, then filtering mmsi whose total count is greater than or equal to 10 
  large_data <- large_data %>% inner_join(count, by=join_by(MMSI_NEW)) %>%
    mutate(MMSI = MMSI_NEW) %>% filter(n > threshold) %>% 
    dplyr::select(-c(n, time_lag, time_lag_exceeds_threshold, group_id, MMSI_NEW)) %>% 
    as.data.frame()

  #writing a cleaned ais data unto Global Environment with a new name
  assign(layer_name, large_data, envir = .GlobalEnv)
  
  #Compare before and after
  comparison <- function(data1, data2) {
    
    before <- dim(data1)[1]
    after <- dim(data2)[1]
    
    diff <- before - after
    
    cat("the total row number of the input dataset is", dim(data1)[1], '\n')
    cat("the total row number of the output dataset is", dim(data2)[1], '\n')
    cat("the difference between the two dataset is", diff)
  }
  
  comparison(df, large_data)
  
}

The following is the result after processing December 1st dataset.

  • the total number of rows for Dec 1st dataset is 1,440,914
  • the total number of rows after pre-processing is 397,122
  • the difference between the two is 1,043,792

The following is the result after processing December 2nd dataset.

  • the total number of rows for Dec 2nd dataset is 1,452,537
  • the total number of rows after pre-processing is 389,281
  • the difference between the two is 1,063,256

The results powerfully demonstrate the importance of the data clean-up process. We found that a large portion of our datasets was redundant or consisted of noise, rendering it unnecessary. This clean-up significantly streamlines the datasets, preparing them effectively for further analysis. It’s undeniable: processing 389,281 rows as opposed to 1,452,537 can make a significant difference in data analysis.

Points to Paths Analysis with Feature Blending Effects

points_to_paths <- function(pnt_data_frame, threshold, layer_name){
  
  #filtering an input data so each vessel should have at least no. of points greater than threhold
  mmsi_count <- pnt_data_frame %>% group_by(MMSI) %>% count()
  
  ais_df <- inner_join(pnt_data_frame, mmsi_count, by = "MMSI")
  
  ais_df <- ais_df %>% filter(n > threshold) #here is threshold parameter
  
  #creating sf point object
  pnt <- st_as_sf(ais_df, coords = c("LONGITUDE", "LATITUDE"), crs = 4326)
  
  #splitting the above 'pnt' layers by mmsi (each ship)
  lines <- split(pnt, pnt$MMSI)
  
  #timestamp ordering function
  f <- function(x) x[order(x$TIMESTAMP),]
  
  #apply time ordering function to all the list in line object
  lines <- lapply(lines, f)
  
  #next step is to combine all point geometries to a single multipoint geometry.
  #using "st_combine" to do that
  
  lines <- lapply(lines, st_combine)
  
  #casting multipoint geometry into linestring object
  lines <- lapply(lines, st_cast, to = "LINESTRING")
  
  # At this stage we have a list of 16,130 individual "linestring" geometries, one for each ship. The list can be combined back to an sfc geometry column using do.call
  geom <- do.call(c, lines)
  
  #transforming geom object into sf object
  layer_lines <- st_as_sf(geom, data.frame(id = names(lines)))
  
  # Assigning the result to a variable in the global environment
  layer_lines_global <- layer_lines
  assign(layer_name, layer_lines_global, envir = .GlobalEnv)
  
  #drawing the output on a tm_map
  #library(mapview)
  #mapview(layer_lines, lwd = 0.1, legend=FALSE)
}

points_to_paths(Dec_01_NEW_DATA, 20, "linestring_1")


Once the initial data clean-up process is completed, one can effortlessly convert existing points into paths by simply casting point data into linestrings, a process that involves linking data points in sequence to form continuous lines. In addition, the vessel type field is used to color-code these linestrings, thereby enhancing visualization. One of the most notable advantages of this approach is its ability to significantly reduce data size. For example, consider the data storage savings: the original dataset for December 1st alone contained 1,440,914 rows and 24 columns, occupying 375MB of storage space. However, following the ‘pre-processing’ and ‘points_to_paths’ conversion, the dataset was reduced to just 1,922 features, requiring merely 4.3MB of storage space. This substantial reduction not only facilitates faster data processing speeds but also results in crisper visualizations and more intuitive interpretations, as evidenced in the figure above.

In addition, I have employed feature blending effects in order to accentuate high density areas where many lines overlap. This technique is instrumental in uncovering hidden patterns and insights, allowing for a more nuanced interpretation. The list of supported options includes but not limited to ‘add’, ‘multiply’, ‘screen’, ‘overlay’, ‘color burn’, etc. A picture is truly worth a thousand words: the figure below illustrates the dramatic contrast when applying ‘multiply’ and ‘add’ effects to the original linestring features. As can be seen, while the multiply effect tends to darken areas of overlap, the add effect creates a lighter and brighter visual impression. Those who are interested in the complete list of available effects and their corresponding descriptions can refer to the following link - https://doc.arcgis.com/en/arcgis-online/create-maps/use-blend-modes-mv.htm

point_line_blending_effect <- function(df, title = "AIS Vis", linewidth = 0.05, color="purple" ,blend = "add") {
  
  library(sf)
  library(ggblend)
  south <- st_read('/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/AIS_visualization',
                   quiet=TRUE, layer = "SouthKorea")
  north <- st_read('/Users/dongheekoh/Documents/Data Science Training/portfolio/projects/AIS_visualization',
                   quiet=TRUE, layer = "NorthKorea")
  
  ggplot(data=df) + geom_sf(data=south) + geom_sf(data=north) +
    geom_path(linewidth=linewidth, color=color,
              mapping=aes(x=LONGITUDE, y=LATITUDE, group=MMSI)) * blend(blend=blend) +
    labs(title = title) + xlab(NULL) + ylab(NULL)
}


#showcasing feature blending effects 
blending_path_multiply <- point_line_blending_effect(Dec_01_NEW_DATA, blend="multiply", title="Multiply", color="darkorange")
blending_path_add <- point_line_blending_effect(Dec_01_NEW_DATA, blend="add", title = "Add", color="darkorange")
patchwork(blending_path_multiply, blending_path_add, ncol=3)

Radar/Sonar Detection Range Analysis

Navy vessels, in peacetime, are primarily used for patrolling purposes. They are equipped with radars and sonars to detect enemy’s potential attacks and threats. Therefore, it is imperative for them to cover as much area as possible during their operation. In this context, Korean navy requested us to develop a tool that can visualize the detection range of vessels and calculate a total coverage area for each operation. To achieve this, we utilized a linestring layer to apply buffer ranges for selected vessels. Since actual detection ranges are classified military secrets, we assigned arbitrary detection ranges to different vessel categories for demonstration purposes only. The resulting visualization and total coverage calculation are displayed in the following figure. The plot on the left shows the state before dissolving overlapping features, whereas the plot on the right displays the state after all overlapping features have been dissolved. The total coverage area is based on the dissolved features.

buffer_union_analysis <- function(df, dist_or_type=NULL, buffer_dist=20){
  
  if(!is.numeric(dist_or_type) & !is.character(dist_or_type)) {
    
    print("Please input 'dist_or_type' parameter")
    
  } else {
    
    #1) Casting MMSI into character
    df$MMSI <- as.character(df$MMSI)
    
    #2) Extracting unique MMSIs from an original data
    inter <- df %>% distinct(MMSI, .keep_all = TRUE)
    
    #3) Points to paths
    points_to_paths(df, threshold = 50, layer_name = "linestring")
    
    #4) Joining linestgin polygon shapefile with attribute data
    join <- linestring %>% left_join(inter, by = c("id" = "MMSI"))
    
    #5) Re-projecting layers so as to have a meter as its unit
    join <- join %>% st_set_crs(4326) %>% st_transform(3857)
    
    #6) Calculating the lengths of trajectories using st_length
    join <- join %>% mutate(distance = as.numeric(st_length(.))) %>% arrange(desc(distance))
    
    
    if(is.numeric(dist_or_type)) {
      
      trajs <- join %>% filter(distance > dist_or_type)
      trajs_buffer <- st_buffer(trajs, buffer_dist*1000)
      
      #union and calculate area
      trajs_buffer_union <- trajs_buffer %>% st_union() %>% st_as_sf() %>% mutate(areas = as.numeric(st_area(.))/1000000)
      
      #draw both buffer and union map side by side
      
      buffer_plot <- ggplot() + geom_sf(data=south) + geom_sf(data=north) + geom_sf(data=trajs_buffer, fill="pink", alpha=0.5)
      union_plot <- ggplot() + geom_sf(data=south) + geom_sf(data=north) + geom_sf(data=trajs_buffer_union, fill="pink", alpha=0.5)
      
      patchwork(buffer_plot, union_plot,
                title=paste("the total area covered by chosen vessels is", round(trajs_buffer_union$areas[1],2), 'm\u00B2'))
      
      
    } else if(is.character(dist_or_type)) {
      
      ship_type <- join %>% filter(SHIPTYPE == dist_or_type)
      ship_type_buffer <- st_buffer(ship_type, buffer_dist*1000)
      
      #union and calculate area
      ship_type_buffer_union <- ship_type_buffer %>% st_union() %>% st_as_sf() %>%mutate(areas = as.numeric(st_area(.))/1000000)
      
      #draw both buffer and union map side by side
      
      buffer_plot <- ggplot() + geom_sf(data=south) + geom_sf(data=north) + geom_sf(data=ship_type_buffer, fill="pink", alpha=0.5)
      union_plot <- ggplot() + geom_sf(data=south) + geom_sf(data=north) + geom_sf(data=ship_type_buffer_union, fill="pink", alpha=0.5)
      
      patchwork(buffer_plot, union_plot,
                title=paste("the total area covered by chosen vessels is", round(ship_type_buffer_union$areas[1],2), 'km\u00B2'))
    }
    
  }
  
}

buffer_union_analysis(Dec_01_NEW_DATA, dist_or_type = "Tug")

Combining Trajectories and Points

The figure below showcases how point and linestring visualization methods can be synthesized. Specifically, the ‘speed’ and ‘course’ fields were utilized to enhance map readability further. On the map, a triangle represents the course of vessels, while color gradation visualizes their speed.

Trajectory Clustering Analysis

In data mining, clustering analysis is commonly used to uncover patterns and insights from unlabeled datasets. Specifically, clustering analysis attempts to classify features into distinct clusters based on the inherent similarities within a dataset. One unique aspect of our analysis is that we have trajectories (not points) as an input for clustering analysis. A trajectory consists of a series of locations generated by a vessel, thus it can clearly show the movement path of an individual vessel. Therefore, the goal of trajectory clustering in this project was to neatly summarize the AIS dataset by identifying groups of trajectories that display similar movement characteristics, which is defined by the shapes of the trajectories. In this section, the methodological details of each method employed are omitted, as they are out of scope of this article. Instead, I will briefly discuss what k-medoids, DBSCAN, and HDBSCAN clustering methods are and present the results from each method.

DBSCAN Results

DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, aims to distinguish “clusters” from “noise” in a dataset. It does so by utilizing two key parameters: the distance (\(\epsilon\)) threshold and the minimum number of points thresholds required to form a cluster (\(MinPts\)). Once these parameters are defined, DBSCAN divides points into three categories: core, border, and noise. A ‘core’ point is labeled as one if it has at least \(MinPts\) within its \(\epsilon\) neighborhood. ‘Border’ points are those that do not meet the \(MinPts\) criterion but are within the \(\epsilon\) distance of a core point. Points that are neither core nor border are classified as ‘noise.’ DBSCAN then merges clusters that are adjacent to each other, forming larger clusters. This iterative process continues until all points in the dataset are evaluated. In DBSCAN analysis, the selection of \(\epsilon\) and \(MinPts\) parameters is crucial, as they significantly influence the results of the analysis. The following figure showcases the results from DBSCAN clustering analysis.

HDBSCAN Results

While the traditional DBSCAN approach is capable of clustering trajectories of any shape, as illustrated in the plot above, it has trouble identifying clusters when datasets display uneven density distributions. However, advancements in clustering methods effectively address these limitations. Building upon DBSCAN approach, HDBSCAN incorporates a hierarchical clustering strategy which enables to determine the epsilon value with a stable function. Moreover, HDBSCAN simplifies the cluster identification process by employing a dendrogram plot. The following graphs show the results from HDBSCAN analysis.

K-medoids Results

K-medoids is the one last method employed in the trajectory clustering analysis. Unlike the two preceding methods, a researcher needs to predetermine a parameter K (i.e., the number of cluster). One can determine this either by visual inspection of datasets or by examining a scree plot to identity a point where the sum of squared error (SSE) is minimized. Once the value of K is determined, the algorithm starts to form clusters while minimizing the distance between trajectories in a cluster and center trajectory. For each iteration, centroid and clusters are reassigned, and this continues until a function converges. The figure below displays the results of K-medoids clustering.

As demonstrated, trajectory clustering analysis is unquestionably revealing some meaningful patterns that would otherwise remain invisible to the unaided eye. However, it should be noted that no single method proves superior across all scenarios. Therefore, one needs to choose the most suitable approach depending on the various characteristics of datasets, such as their density and distribution patterns. This selection process demands considerable time and effort as even small adjustments in tuning parameters may lead to substantial changes in analysis outcomes. Since the primary aim of this article has been to showcase various visualization techniques for AIS data mining, I have not covered the specifics of identifying optimal tuning parameters for each method. However, I am confident that with a more thorough scientific investigation, clustering analysis will be able to uncover even deeper insights.

Project Summary

This brings us to the end of this article. As aimed at the outset, I have comprehensively discussed all the methods employed in developing the AIS visualization application. The front-end user interface was developed in qt c++ by a developer, and the final product was delivered to the Korean Navy in November 2023. As this article has illustrated, analyzing AIS data can uncover insightful patterns and enhance our understanding of maritime traffic like never before. The realm of AIS research is still rapidly expanding, offering numerous opportunities for further exploration. My future research, if time permits, will mainly focus on leveraging AIS data for trajectory prediction, anomaly detection, navigation, surveillance, among other applications.